Simulate the Data

Single cell RNA-seq data is simulated to represent a situation in which 2 groups of cells generated through some experimental procedure are found to have heterogenous expression in a number of genes. Both groups also possess genes that are differentially expressed compared to those of a group of control cells.

We will show that the two groups of cells subjected to the experimental procedure are indistinguishable when subjected to dimension reduction techniques that do not take into account the information stored in the control cells.

# simulate the three groups of cells such that cell heterogeneity is masked by
# some batch effect
params <- newSplatParams(
  seed = 6757293,
  nGenes = 500,
  batchCells = c(150, 150),
  batch.facLoc = c(0.05, 0.05),
  batch.facScale = c(0.05, 0.05),
  group.prob = rep(1/3, 3),
  de.prob = c(0.1, 0.05, 0.1),
  de.downProb = c(0.1, 0.05, 0.1),
  de.facLoc = rep(0.2, 3),
  de.facScale = rep(0.2, 3)
)
sim_groups_sce <- splatSimulate(params, method = "groups")

# get the logcounts of the data
sim_groups_sce <- normalize(sim_groups_sce)

# remove all cells without variation in counts
sim_groups_sce <- sim_groups_sce[which(rowVars(counts(sim_groups_sce)) != 0), ]

PCA of All Cells

We take the first two principal components of the entire dataset to illustrate that the variance caused by the batch effect dominates all other signals in the data.

Applying Dim. Red. Techniques to Target Data

Now, we focus on applying variuos dimension reduction techniques to the target data, i.e. the cells that were subjected to some experimental procedure. The transcriptome data belonging to the control cells is used as a background dataset for cPCA and scPCA.

PCA

cPCA

Effect of Contrastive Parameter

scPCA

Comparison cPCA and scPCA Loadings Matrices

Gene DEFacGroup1 DEFacGroup3 diff scPCA1
Gene327 1.0000000 2.2944560 1.2944560 1
Gene201 2.0011872 1.0000000 1.0011872 0
Gene346 1.0000000 1.9037754 0.9037754 1
Gene240 1.0000000 1.7944498 0.7944498 0
Gene128 1.7802675 1.0000000 0.7802675 1
Gene473 1.0000000 1.7703224 0.7703224 0
Gene214 1.6983572 1.0000000 0.6983572 0
Gene188 1.0000000 1.6860079 0.6860079 1
Gene307 1.6467488 1.0000000 0.6467488 1
Gene192 1.6426308 1.0000000 0.6426308 1
Gene44 1.6113054 1.0000000 0.6113054 1
Gene454 1.0000000 1.5922447 0.5922447 1
Gene270 1.0000000 1.5701455 0.5701455 0
Gene304 1.0000000 1.5430068 0.5430068 0
Gene190 1.0000000 1.5176864 0.5176864 1
Gene383 0.4846386 1.0000000 0.5153614 1
Gene158 1.5126046 1.0000000 0.5126046 1
Gene8 1.5483328 1.0770052 0.4713276 1
Gene370 1.4712158 1.0000000 0.4712158 1
Gene364 1.4459099 1.0000000 0.4459099 1
Gene66 1.0000000 1.4211804 0.4211804 1
Gene68 1.0000000 1.3941840 0.3941840 1
Gene10 1.3941636 1.0000000 0.3941636 0
Gene54 1.0000000 1.3759716 0.3759716 1
Gene315 1.0000000 1.3691284 0.3691284 0
Gene3 1.3611369 1.0000000 0.3611369 0
Gene135 1.3587745 1.0000000 0.3587745 0
Gene334 1.0000000 0.6421265 0.3578735 0
Gene196 1.3468610 1.0000000 0.3468610 1
Gene245 1.0000000 1.3271180 0.3271180 0
Gene220 1.3079478 1.0000000 0.3079478 0
Gene342 1.0000000 1.2988780 0.2988780 0
Gene380 1.2972411 1.0000000 0.2972411 0
Gene228 1.2931549 1.0000000 0.2931549 0
Gene363 0.7128209 1.0000000 0.2871791 0
Gene229 1.2861742 1.0000000 0.2861742 0
Gene80 0.7147228 1.0000000 0.2852772 0
Gene100 1.0000000 1.2837038 0.2837038 0
Gene338 1.2741687 1.0000000 0.2741687 0
Gene275 1.0000000 1.2706184 0.2706184 0
Gene108 1.0000000 1.2704223 0.2704223 0
Gene436 0.7300889 1.0000000 0.2699111 0
Gene143 1.2692307 1.0000000 0.2692307 0
Gene254 1.2596478 1.0000000 0.2596478 0
Gene353 1.0000000 1.2574056 0.2574056 0
Gene489 1.0000000 1.2499518 0.2499518 0
Gene285 1.2458224 1.0000000 0.2458224 1
Gene103 1.2402441 1.0000000 0.2402441 0
Gene482 1.0000000 1.2327516 0.2327516 0
Gene258 1.0000000 0.7857538 0.2142462 0
Gene218 1.0000000 1.2088149 0.2088149 0
Gene458 1.0000000 1.1934206 0.1934206 0
Gene235 1.1802687 1.0000000 0.1802687 1
Gene197 1.0000000 0.8232851 0.1767149 0
Gene453 1.0000000 1.1731249 0.1731249 0
Gene36 1.0000000 0.8368966 0.1631034 0
Gene28 1.0000000 1.1586270 0.1586270 0
Gene193 1.0000000 1.1564362 0.1564362 0
Gene55 1.1393105 1.2937132 0.1544027 0
Gene302 1.0000000 1.1483239 0.1483239 0
Gene238 1.1441278 1.0000000 0.1441278 0
Gene30 1.0000000 1.1425416 0.1425416 0
Gene75 1.0000000 1.1370305 0.1370305 0
Gene11 1.0000000 1.1324617 0.1324617 0
Gene424 1.1310105 1.0000000 0.1310105 0
Gene70 1.0000000 0.8755727 0.1244273 0
Gene169 1.0000000 1.1230450 0.1230450 0
Gene405 1.1723449 1.0548861 0.1174588 0
Gene250 1.0000000 0.8854873 0.1145127 0
Gene46 1.0000000 1.1050231 0.1050231 0
Gene145 1.0000000 1.0937630 0.0937630 0
Gene374 1.0918226 1.0000000 0.0918226 0
Gene399 1.0000000 1.0807069 0.0807069 0
Gene484 1.0773906 1.0000000 0.0773906 0
Gene475 1.0760033 1.0000000 0.0760033 0
Gene222 1.0737436 1.0000000 0.0737436 0
Gene202 1.6427858 1.7156525 0.0728667 0
Gene278 1.0726935 1.0000000 0.0726935 0
Gene132 1.0000000 1.0715175 0.0715175 0
Gene468 1.0000000 0.9316385 0.0683615 0
Gene292 1.0627392 1.0000000 0.0627392 0
Gene126 1.0621330 1.0000000 0.0621330 0
Gene239 1.0000000 1.0619006 0.0619006 0
Gene116 1.0000000 1.0596130 0.0596130 0
Gene231 1.0000000 1.0546435 0.0546435 0
Gene118 1.0474613 1.0000000 0.0474613 0
Gene256 1.0406122 1.0000000 0.0406122 0
Gene227 1.0000000 1.0400291 0.0400291 0
Gene455 1.0373562 1.0000000 0.0373562 0
Gene347 1.3211550 1.2865117 0.0346433 0
Gene403 1.0000000 1.0307340 0.0307340 0
Gene309 1.0181823 1.0000000 0.0181823 0
Gene416 1.1885126 1.1734931 0.0150194 0
Gene461 1.0000000 1.0108062 0.0108062 0
Gene291 1.0089752 1.0000000 0.0089752 0
Gene396 1.0000000 1.0071671 0.0071671 0
Gene123 1.0041118 1.0000000 0.0041118 0
Gene434 1.0000000 1.0029095 0.0029095 0

Of the 98 differentially expressed genes, scPCA identified the most prominent. Of the 20 genes with non-zero values in the first row of scPCA’s loading matrix, 20 corresponded to differentially expressed genes.

cPCA (Tuned with Cross-Validation)

5-fold cross-validation was used to tune the constrastive parameter.

scPCA (Tuned with Cross-Validation)

5-fold cross-validation was used to tune the hyperparameters.

t-SNE

With Initial PCA Step:

Without Initial PCA Step

Since both initializations produced qualitatively identical embeddings, t-SNE without the initial PCA step is considered in the manuscript.

UMAP

ZINB-WaVE

SIMLR

## Computing the multiple Kernels.
## Performing network diffiusion.
## Iteration:  1 
## Iteration:  2 
## Iteration:  3 
## Iteration:  4 
## Iteration:  5 
## Iteration:  6 
## Iteration:  7 
## Iteration:  8 
## Iteration:  9 
## Iteration:  10 
## Performing t-SNE.
## Epoch: Iteration # 100  error is:  0.1292803 
## Epoch: Iteration # 200  error is:  0.1265534 
## Epoch: Iteration # 300  error is:  0.125092 
## Epoch: Iteration # 400  error is:  0.1248576 
## Epoch: Iteration # 500  error is:  0.1248467 
## Epoch: Iteration # 600  error is:  0.1248464 
## Epoch: Iteration # 700  error is:  0.1248464 
## Epoch: Iteration # 800  error is:  0.1248464 
## Epoch: Iteration # 900  error is:  0.1248464 
## Epoch: Iteration # 1000  error is:  0.1248464 
## Performing Kmeans.
## Performing t-SNE.
## Epoch: Iteration # 100  error is:  9.700828 
## Epoch: Iteration # 200  error is:  0.1609826 
## Epoch: Iteration # 300  error is:  0.1385839 
## Epoch: Iteration # 400  error is:  0.1361648 
## Epoch: Iteration # 500  error is:  0.1334647 
## Epoch: Iteration # 600  error is:  0.130636 
## Epoch: Iteration # 700  error is:  0.1278329 
## Epoch: Iteration # 800  error is:  0.126011 
## Epoch: Iteration # 900  error is:  0.125326 
## Epoch: Iteration # 1000  error is:  0.1252496

Combined Plots

Average Sil. Width Plot

Note: The SIMLR results are not incluced in the figure since the average silhouette width values are misleading; the batch effect is not removed. The deceptively good average silhouette widths are a product of SIMLR’s low-dimensional representation of the data: the distance between biological clusters is very large and these clusters are compact. However, they fail to remove the batch effect.